Books Data Analysis Report
MTH208
- Data Science Lab - I
Instructor: Dr. Dootika Vats
Group 6: Raj Chandravanshi, Aman Sanwal,
Kritika Pandey, Rhythm Agrawal
We would like to extend our heartfelt thanks to Professor Dootika Vats for her invaluable guidance and support throughout the development of our Book Data Analysis project. Her expert feedback, encouragement, and dedication have played a crucial role in the success of this project. We are deeply grateful for her patience and commitment in helping us overcome the challenges we encountered along the way.
This report provides an analysis of the top 500 ranked books on Goodreads. Our aim is to uncover patterns in reader preferences, genre popularity, author influence, and other trends based on a comprehensive dataset we obtained through web scraping.
The data for this analysis was scraped from Goodreads and includes
the following features:
- Rank: Book ranking based
on popularity and rating.
- Title: Book title.
- Author Name: Author(s) of the book.
-
Average Rating: Average user rating of the book.
-
Number of Ratings (Rater): Total ratings given to the
book.
- Score: Goodreads score based on votes and
ratings.
- Voters: Number of users who voted on the
book’s score.
- First Published Date: Initial
publication date of the book.
- Price: Price of the
book.
- Author Followers: Count of followers of the
author on Goodreads.
- Number of Reviews: Total
reviews for the book.
- Top Genres: Top five genres
associated with the book.
- Rating Distribution:
Distribution of 1- to 5-star ratings.
- Cover Type:
Format of the book cover (e.g., paperback, ebook).
- Author
Average Rating: Average rating across all books by the
author.
- Medium of Publication: Tye of publication
medium for each book listed.
We assume that every reader who rates a book has purchased it,
either physically or digitally, but this overlooks the possibility of
pirated versions, meaning raters may not be directly proportional to the
number of buyers.
We assume that raters have purchased and read the book, making
ratings proportional to book sales. However, some people may rate a book
without reading it, potentially influenced by others, which could skew
the accuracy of the ratings.
Our dataset is biased toward popular, top-ranked books on Goodreads, excluding lower-ranked titles. To achieve a more balanced dataset, we could include books with a wider range of rankings, including less popular ones.
## [1] "Rank" "Title" "authorName"
## [4] "avg_rating" "rater" "score"
## [7] "voter" "price" "First_published"
## [10] "pages" "reviews" "followers"
## [13] "top_genre" "second_genre" "third_genre"
## [16] "fourth_genre" "fifth_genre" "Five_stars"
## [19] "Four_stars" "Three_stars" "Two_stars"
## [22] "One_stars" "cover_type" "author_avg_rating"
## [25] "medium_of_publication"
## Rank Title authorName avg_rating
## Min. : 1.0 Length:500 Length:500 Min. :3.330
## 1st Qu.:125.8 Class :character Class :character 1st Qu.:3.947
## Median :250.5 Mode :character Mode :character Median :4.110
## Mean :250.5 Mean :4.108
## 3rd Qu.:375.2 3rd Qu.:4.280
## Max. :500.0 Max. :4.810
##
## rater score voter price
## Min. : 1022 Min. : 23761 Min. : 252 Min. : 0.000
## 1st Qu.: 301825 1st Qu.: 36676 1st Qu.: 464 1st Qu.: 1.990
## Median : 543634 Median : 64096 Median : 783 Median : 8.235
## Mean : 934446 Mean : 210582 Mean : 2305 Mean : 7.671
## 3rd Qu.: 1016434 3rd Qu.: 158691 3rd Qu.: 1805 3rd Qu.:12.592
## Max. :10414224 Max. :3913869 Max. :39812 Max. :74.990
## NA's :200
## First_published pages reviews followers
## Length:500 Min. : 26.0 Min. : 52 Min. : 1
## Class :character 1st Qu.: 254.5 1st Qu.: 11642 1st Qu.: 5362
## Mode :character Median : 370.0 Median : 22048 Median : 20700
## Mean : 436.6 Mean : 36826 Mean : 81573
## 3rd Qu.: 509.5 3rd Qu.: 44941 3rd Qu.: 68700
## Max. :4100.0 Max. :295811 Max. :857000
## NA's :1
## top_genre second_genre third_genre fourth_genre
## Length:500 Length:500 Length:500 Length:500
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## fifth_genre Five_stars Four_stars Three_stars
## Length:500 Min. :19.00 Min. : 3.00 Min. : 2.0
## Class :character 1st Qu.:36.00 1st Qu.:29.00 1st Qu.:12.0
## Mode :character Median :42.00 Median :32.00 Median :16.0
## Mean :43.86 Mean :31.11 Mean :16.3
## 3rd Qu.:51.00 3rd Qu.:34.00 3rd Qu.:20.0
## Max. :90.00 Max. :42.00 Max. :33.0
##
## Two_stars One_stars cover_type author_avg_rating
## Min. :0.00 Min. :0.00 Length:500 Min. :0.000
## 1st Qu.:2.00 1st Qu.:1.00 Class :character 1st Qu.:3.920
## Median :4.00 Median :1.00 Mode :character Median :4.070
## Mean :3.96 Mean :1.88 Mean :4.059
## 3rd Qu.:5.00 3rd Qu.:2.00 3rd Qu.:4.210
## Max. :9.00 Max. :9.00 Max. :4.810
##
## medium_of_publication publishtion_year
## Length:500 Min. : 401
## Class :character 1st Qu.:1951
## Mode :character Median :1989
## Mean :1948
## 3rd Qu.:2006
## Max. :2023
##
## Rank Title authorName
## 0 0 0
## avg_rating rater score
## 0 0 0
## voter price First_published
## 0 200 0
## pages reviews followers
## 1 0 0
## top_genre second_genre third_genre
## 0 1 1
## fourth_genre fifth_genre Five_stars
## 1 1 0
## Four_stars Three_stars Two_stars
## 0 0 0
## One_stars cover_type author_avg_rating
## 0 0 0
## medium_of_publication publishtion_year
## 0 0
Conclusion:- Based on the
summary statistics and the missing values analysis for the group_6
dataset, here are the key insights and conclusions:
Summary
Statistics Overview:
- The dataset ranks 500 top books with
ranks ranging from 1 to 500, and an average rating between 3.33 and
4.81, with a median rating of 4.11. This suggests a generally high
quality of books.
- Rater, score, and voter columns exhibit a wide
range, with maximum values reaching into millions, indicating some
highly popular books.
- The price feature has a large range (from 0
to 74.99), suggesting varying cost of books. The median price is 8.235,
but there are missing values in this column (200 missing entries).
-
Pages count varies significantly, with a minimum of 26 and a maximum of
4,100 (Because the data contains some set of books in individual
rank, that’s why the number of pages is higher ) pages. The
median value of 370 suggests that most books are moderately lengthy.
- Review counts and follower counts have high maximums (295,811 reviews
and 857,000 followers), indicating the presence of very popular
authors/books in the dataset.
Missing Values
Analysis:
- The price column has a substantial number of
missing values (200 out of 500), which might need handling to avoid
issues in subsequent analyses.
- The pages column has 1 missing
value, while genre columns (second_genre, third_genre, fourth_genre,
fifth_genre) also have 1 missing value each, which may slightly impact
genre-based analysis.
- Other features are complete, indicating good
data quality overall, with minimal missing entries.
Conclusion: The density plot shows that most books have a high average rating, with a peak around 4 stars.
Conclusion: Fiction, Fantasy, and Classics are the most common genres among top-rated books.
## `geom_smooth()` using formula = 'y ~ x'
Conclusion:- This plot shows a moderate positive correlation (0.32) between the number of pages in a book and its average rating, suggesting that, to some extent, readers tend to favor longer books over shorter ones.
## `geom_smooth()` using formula = 'y ~ x'
Conclusion:- We can see a very less negative correlation amongst the number of pages and the number of raters(readers) , hence the number of readers decreases only on significant increase in the book size.
## `geom_smooth()` using formula = 'y ~ x'
Conclusion:- Although we might have expected the price of the book to increase with the number of pages, the data shows that this is not the case.
Conclusion:- One might have expected that books with lower prices may have more ratings due to higher accessibility, while more expensive books might have fewer ratings.However this can be observed to be not true as we observe that the number of ratings is almost uncorrelated to the price of the book , which may indicate that for these top ranked books people are willing to pay money even for an expensive book.
Conclusion:- Conclusion:- We can see the transition in sales(number of raters/readers) from the classics to fiction from 1850s to 1950s and further fiction has shifted to fantasy genre , young adult and romance in the 21st century which is clearly visible in the below plot.
Conclusion:- We can observe that the genres of Graphic novels , Picture books and Childrens books have the highest average rating , which may suggest that children give higher rating to books than adults and maybe that in adulthood we are able to assess things more critically .
Conclusion:- We can observe that for a given price the number of raters(the people reading it after purchasing) is not affected if the medium of publication is digital or physical .
Conclusion:- One might have thought that for a given price the people might rate the book with physical cover higher than the digital books as reading a physical book has a certain feel to it however we observe that it does not seem to be true as for a given price the medium of publication does not affect the avg rating much.
Conclusion:- We can observe that the books with more recent publication dates that is those of after 2000’s have higher number of ratings and hence can be assumed to have been read much more in the recent years as compared to the older books.
11. Some Additional Visualisations
Conclusion:-
The analysis shows that the majority of books in the dataset have a paperback cover type. This suggests that paperback editions are more prevalent, likely due to their affordability and wider availability compared to hardcover or digital formats.
Conclusion:- We can observe that the price does not influence the average rating received by the book. Therefore, our assumption that a higher price might lead to a lower average rating was incorrect across all genres.
This plot shows the box plot of the average rating based on top 5
genres.
12. Overall Conclusion
The analysis highlights several trends in book popularity, genre influence, author reputation, and the relationship between book characteristics like page count and ratings. While certain genres dominate the top ranks, the quality of authors and book content also plays a significant role in reader reception.
13. Challenges Faced
We frequently face timeout issues due to the complexity of the data
structure, which leads to long execution times exceeding 30 minutes. The
inefficiency in R loops further compounds the problem, as they struggle
to process large datasets quickly enough. As a result, the system times
out before completing the task. Despite various optimization attempts,
the intricate data handling and slow loop execution continue to lead to
persistent timeout errors, significantly affecting the performance and
efficiency of our code. Another challenge encountered during the project
was the missing values in the ‘price’ column for several books. This
inconsistency affected the accuracy and efficiency of certain
visualizations, particularly those involving price-based analysis.
Another challenge encountered during the project was the missing values
in the ‘price’ column for several books. This inconsistency may have
affected the accuracy of certain visualizations involving price-based
analysis a bit.